System Design Refresher
Table of Contents
1. Networking & Communication
- IP Addressing
- DNS (Domain Name System)
- Load Balancers
- L4 vs L7
- Proxy
- Forward Proxy
- Reverse Proxy
- TCP vs UDP
- HTTP/HTTPS Basics
- HTTP Methods
- Important Headers
- HTTP Status Codes
- REST vs GraphQL vs gRPC
- Real-Time Communication
- WebSockets
- Long Polling
- Server-Sent Events (SSE)
- CDN (Content Delivery Network)
2. Storage & Databases
- Relational Databases (SQL)
- ACID Properties
- Isolation Levels
- Normalization
- Database Replication Patterns
- Master-Slave (Primary-Replica)
- Master-Master (Multi-Master)
- NoSQL Databases
- Key-Value Stores
- Document Databases
- Wide-Column Stores
- Graph Databases
- Database Indexes
- B-Tree Indexes
- Hash Indexes
- Inverted Indexes
- Geospatial Indexes
- Replication & Sharding
- Horizontal Scaling (Sharding)
- Vertical Scaling
- CAP Theorem
- Query Optimization
- CDC (Change Data Capture)
- Full-Text Search
- Caching Strategies
- Cache Levels
- Cache Patterns
- Cache Eviction Policies
- Cache Stampede
3. Scalability & Reliability
- Load Balancing Algorithms
- Round Robin
- Least Connections
- Consistent Hashing
- IP Hash
- Least Response Time
- Rate Limiting & Throttling
- Algorithms (Token Bucket, Leaky Bucket, etc.)
- Message Queues & Streams
- Kafka
- RabbitMQ
- AWS SQS
- Backpressure
- Consumer Groups
- Leader Election
- Raft
- Paxos
- ZooKeeper
- Failover & Redundancy
- Active-Passive
- Active-Active
4. System Design Patterns
- Read vs Write-Heavy Systems
- CQRS (Command Query Responsibility Segregation)
- Event Sourcing
- Caching Patterns (Revisited)
- Write-Through
- Write-Back
- Write-Around
- Idempotency & Retries
- Consistency Models
- Strong Consistency
- Eventual Consistency
- Causal Consistency
- Read-Your-Writes Consistency
- Monotonic Reads
5. Advanced Caching
- CDN Deep Dive
- Push CDN
- Pull CDN
- Edge Computing
- Redis Advanced Patterns
- Application-Level Caching
- Cache Invalidation Strategies
6. Observability
- Backoff, Jitter, and Retry Strategies
- Exponential Backoff
- Jitter
- Circuit Breaker Pattern
- Logging Best Practices
- Structured Logging
- Log Levels
- Correlation IDs
- Monitoring & Metrics
- Key Metrics (RED Method)
- USE Method
- Golden Signals
- Metric Types
- Prometheus & Grafana
- Alerting Best Practices
- Distributed Tracing
- Jaeger
- OpenTelemetry
- SLO, SLI, SLA
- Error Budgets
7. Security & Privacy
- Authentication vs Authorization
- Authentication Mechanisms
- JWT
- OAuth 2.0
- SSO
- Session-Based Authentication
- Encryption
- TLS/HTTPS
- Data at Rest
- Hashing vs Encryption
- DDoS Protection & WAF
- Data Privacy (GDPR Basics)
8. Infrastructure & Deployment
- Containers & Orchestration
- Docker
- Kubernetes
- CI/CD Pipelines
- Continuous Integration
- Continuous Deployment
- Deployment Strategies
- Service Discovery
- API Gateway
- Microservices vs Monoliths
9. Special Topics
- Search Systems
- Inverted Index
- Ranking Algorithms
- Elasticsearch Architecture
- Bloom Filters
- Recommendation Systems
- Collaborative Filtering
- Content-Based Filtering
- Distributed Transactions
- Two-Phase Commit (2PC)
- Saga Pattern
- Consensus Algorithms
- Raft
- Paxos
- Time & Ordering in Distributed Systems
- Lamport Clocks
- Vector Clocks
- True Time
10. Additional Important Topics
- Back-of-the-Envelope Calculations
- Polling vs Push vs Long Polling
- Database Connection Pooling
- Partitioning vs Sharding
- Webhooks
- Reverse Hash Lookup
Interview Preparation
- How to Approach System Design Interviews
- Common Mistakes to Avoid
- Key Trade-offs to Discuss
- Practice Questions
Quick Reference
- When to Use SQL vs NoSQL
- Caching Decision Tree
- Database Replication Strategy
- Message Queue vs Database
1. Networking & Communication
IP Addressing
- IPv4: 32-bit addresses (e.g., 192.168.1.1), supports ~4.3 billion addresses
- IPv6: 128-bit addresses, designed to solve IPv4 exhaustion
- Private vs Public IPs: Private IPs (10.x.x.x, 192.168.x.x) for internal networks, public for internet-facing
- CIDR Notation: 192.168.1.0/24 means first 24 bits are network, last 8 for hosts
DNS (Domain Name System)
- Translates domain names to IP addresses
- Hierarchical system: Root → TLD (.com) → Authoritative nameserver
- Record types:
- A: Maps domain to IPv4
- AAAA: Maps to IPv6
- CNAME: Alias to another domain
- MX: Mail server
- NS: Nameserver
- DNS caching: Browsers, OS, recursive resolvers cache results (TTL-based)
- Interview tip: DNS is often a single point of failure; use multiple nameservers
Load Balancers
L4 (Layer 4 - Transport Layer)
- Operates at TCP/UDP level
- Routes based on IP address and port
- Faster, less inspection overhead
- Cannot route based on content (URL, headers)
- Use case: High-throughput, low-latency requirements
L7 (Layer 7 - Application Layer)
- Operates at HTTP/HTTPS level
- Routes based on URLs, headers, cookies
- Can do SSL termination, content-based routing
- More CPU intensive
- Use case: Microservices with different endpoints, A/B testing
Proxy
Forward Proxy
- Client-side proxy
- Client → Forward Proxy → Internet
- Use cases:
- Content filtering in organizations
- Anonymity (VPN-like behavior)
- Caching for clients
- Example: Corporate proxy server
Reverse Proxy
- Server-side proxy
- Client → Reverse Proxy → Backend servers
- Use cases:
- Load balancing
- SSL termination
- Caching
- Security (hide backend infrastructure)
- Examples: Nginx, HAProxy
TCP vs UDP
| Feature | TCP | UDP |
|---|---|---|
| Connection | Connection-oriented (3-way handshake) | Connectionless |
| Reliability | Guaranteed delivery, ordered | No guarantee, may lose/reorder packets |
| Speed | Slower (overhead) | Faster |
| Use cases | HTTP, SSH, File transfers | Video streaming, DNS, Gaming, VoIP |
| Flow control | Yes (prevents overwhelming receiver) | No |
| Error checking | Extensive | Basic checksum |
Interview insight: TCP trades speed for reliability; UDP trades reliability for speed
HTTP/HTTPS Basics
HTTP Methods
- GET: Retrieve resource (idempotent, cacheable)
- POST: Create resource (not idempotent)
- PUT: Update/replace entire resource (idempotent)
- PATCH: Partial update (not necessarily idempotent)
- DELETE: Remove resource (idempotent)
- HEAD: Like GET but without response body
- OPTIONS: Check available methods
Important Headers
- Cache-Control: Directives for caching (max-age, no-cache, no-store)
- ETag: Resource version identifier for conditional requests
- Authorization: Bearer tokens, API keys
- Content-Type: MIME type (application/json, text/html)
- User-Agent: Client information
- Accept: Content types client can process
HTTP Status Codes
- 2xx: Success (200 OK, 201 Created, 204 No Content)
- 3xx: Redirection (301 Moved Permanently, 304 Not Modified)
- 4xx: Client errors (400 Bad Request, 401 Unauthorized, 404 Not Found, 429 Too Many Requests)
- 5xx: Server errors (500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable)
REST vs GraphQL vs gRPC
REST
- Resource-based URLs (/users/123)
- Standard HTTP methods
- Over-fetching/under-fetching possible
- Easy caching via HTTP
- Best for: Public APIs, CRUD operations
GraphQL
- Single endpoint (/graphql)
- Client specifies exact data needed
- Reduces over-fetching
- More complex server-side
- Best for: Complex data requirements, mobile apps (bandwidth concerns)
gRPC
- Uses Protocol Buffers (binary format)
- HTTP/2 based, bidirectional streaming
- Strongly typed contracts
- More efficient than JSON
- Best for: Internal microservices, high-performance requirements
Real-Time Communication
WebSockets
- Full-duplex communication over single TCP connection
- Persistent connection (after HTTP upgrade)
- Low latency, real-time bidirectional
- Use cases: Chat apps, live trading, multiplayer games
- Trade-off: Stateful, harder to scale (requires sticky sessions)
Long Polling
- Client requests, server holds connection until data available
- Then responds, client immediately requests again
- More overhead than WebSockets but better compatibility
- Use cases: Real-time updates where WebSockets unavailable
Server-Sent Events (SSE)
- Server pushes updates to client over HTTP
- Unidirectional (server → client)
- Auto-reconnects, built-in event IDs
- Use cases: News feeds, stock tickers, notifications
- Trade-off: Only server-to-client (unlike WebSockets)
CDN (Content Delivery Network)
- Distributed servers at edge locations (geographically closer to users)
- Benefits:
- Reduced latency (geographic proximity)
- Lower bandwidth costs
- DDoS protection
- High availability
- Edge caching: Static assets cached at CDN edges
- Geo-replication: Content replicated across regions
- Invalidation: Can purge/update cached content
- Examples: CloudFlare, Akamai, CloudFront
2. Storage & Databases
Relational Databases (SQL)
ACID Properties
- Atomicity: All operations in transaction succeed or all fail (no partial states)
- Consistency: Database remains in valid state (constraints honored)
- Isolation: Concurrent transactions don't interfere
- Durability: Committed data persists even after crashes
Isolation Levels
- Read Uncommitted: Can read uncommitted changes (dirty reads)
- Read Committed: Only reads committed data (default in many DBs)
- Repeatable Read: Same query returns same results in transaction
- Serializable: Strongest isolation, transactions fully isolated
Normalization
- 1NF: Atomic values, no repeating groups
- 2NF: 1NF + no partial dependencies (all non-key attributes depend on entire primary key)
- 3NF: 2NF + no transitive dependencies
- Trade-off: More normalized = less redundancy but more joins; denormalization for read performance
Database Replication Patterns
Master-Slave (Primary-Replica)
- Write: Goes to master only
- Read: Can be served from slaves/replicas
- Pros: Scales reads, simple architecture
- Cons: Master is single point of failure for writes, replication lag
- Use case: Read-heavy workloads (90% reads, 10% writes)
Master-Master (Multi-Master)
- Write: Can go to any master
- Read: From any master
- Pros: No single point of failure for writes, better write scaling
- Cons: Complex conflict resolution, potential data inconsistencies
- Conflict resolution: Last-write-wins, version vectors, custom logic
- Use case: Globally distributed systems, high write availability
NoSQL Databases
Key-Value Stores
- Structure: Simple key → value mapping
- Examples: Redis, DynamoDB, Riak
- Pros: Extremely fast, simple, horizontally scalable
- Cons: Limited query capabilities (no joins, no complex queries)
- Use cases: Session storage, caching, shopping carts
Document Databases
- Structure: Store JSON-like documents
- Examples: MongoDB, CouchDB, Firestore
- Pros: Flexible schema, good for hierarchical data, can index/query fields
- Cons: No joins (embed or reference), eventual consistency
- Use cases: Content management, user profiles, catalogs
Wide-Column Stores
- Structure: Column families, rows can have different columns
- Examples: Cassandra, HBase, BigTable
- Pros: Efficient for sparse data, scales horizontally, fast writes
- Cons: Complex data modeling, eventual consistency
- Use cases: Time-series data, IoT sensors, event logging
Graph Databases
- Structure: Nodes, edges, properties
- Examples: Neo4j, JanusGraph, Amazon Neptune
- Pros: Efficient for relationship queries, natural for connected data
- Cons: Harder to scale horizontally, specialized use cases
- Use cases: Social networks, recommendation engines, fraud detection
Database Indexes
B-Tree Indexes
- Structure: Balanced tree, sorted data
- Pros: Good for range queries, ordered data
- Operations: O(log n) for search, insert, delete
- Use case: Default index in most SQL databases
- Example:
WHERE age BETWEEN 25 AND 35
Hash Indexes
- Structure: Hash table
- Pros: O(1) for exact match lookups
- Cons: No range queries, no ordering
- Use case: Equality comparisons only
- Example:
WHERE user_id = 123
Inverted Indexes
- Structure: Maps terms to document IDs containing them
- Used in: Full-text search engines
- Example:
- Doc1: "quick brown fox"
- Doc2: "brown dog"
- Index: "brown" → [Doc1, Doc2]
- Use case: Search functionality (Elasticsearch)
Geospatial Indexes
- Types:
- R-tree: For spatial data (rectangles, polygons)
- Quadtree: Divides space into quadrants recursively
- Geohash: Encodes lat/long into string
- Use cases: "Find restaurants within 5km", location-based services
- Examples: MongoDB geospatial, PostGIS
Replication & Sharding
Horizontal Scaling (Sharding)
- Definition: Distribute data across multiple machines
- Sharding strategies:
- Hash-based: hash(key) % num_shards
- Range-based: user_id 1-1M on shard1, 1M-2M on shard2
- Geography-based: EU users on EU shard, US users on US shard
- Directory-based: Lookup table maps keys to shards
- Challenges:
- Cross-shard joins expensive
- Rebalancing shards when adding nodes
- Choosing good shard key (avoid hotspots)
Vertical Scaling
- Definition: Add more resources (CPU, RAM) to single machine
- Pros: Simpler (no distributed complexity)
- Cons: Limited by hardware limits, expensive, single point of failure
- When to use: Before reaching limits, for databases requiring strong consistency
CAP Theorem
You can only have 2 of 3: Consistency, Availability, Partition tolerance
- Consistency: All nodes see same data at same time
- Availability: Every request gets response (success/failure)
- Partition tolerance: System works despite network partitions
In practice: Network partitions will happen, so choose between:
- CP (Consistency + Partition tolerance): Sacrifice availability during partition
- Examples: HBase, MongoDB (strong consistency mode)
- Use case: Financial systems, inventory
- AP (Availability + Partition tolerance): Sacrifice consistency during partition
- Examples: Cassandra, DynamoDB, Riak
- Use case: Social media feeds, analytics
PACELC Theorem (more realistic):
- If Partition, choose A or C
- Else (no partition), choose Latency or Consistency
Query Optimization
Techniques
- Indexes: Most critical (but adds write overhead)
- Query analysis: Use EXPLAIN to see execution plan
- Avoid SELECT *: Fetch only needed columns
- Limit result sets: Pagination, WHERE clauses
- Denormalization: For read-heavy workloads
- Partitioning: Split large tables
- Connection pooling: Reuse database connections
- Caching: Redis for frequently accessed data
Common Issues
- N+1 queries: Fetching related data in loop (use joins or batch fetches)
- Full table scans: Missing indexes on WHERE/JOIN columns
- Suboptimal joins: Wrong join order or type
CDC (Change Data Capture)
- Purpose: Track changes in database (inserts, updates, deletes)
- Methods:
- Log-based: Read database transaction logs (MySQL binlog, Postgres WAL)
- Trigger-based: Database triggers on changes
- Timestamp-based: Check last_modified column
- Use cases:
- Data replication to data warehouse
- Invalidating caches
- Event-driven architectures
- Tools: Debezium, Maxwell, AWS DMS
Full-Text Search
- Problem: SQL LIKE '%keyword%' is slow (can't use indexes)
- Solution: Specialized search engines with inverted indexes
- Features:
- Tokenization (breaking text into terms)
- Stemming (run, running → run)
- Relevance scoring (TF-IDF, BM25)
- Fuzzy matching (typo tolerance)
- Examples: Elasticsearch, Solr, Algolia
- Architecture: Separate search cluster, sync from primary DB via CDC
Caching Strategies
Cache Levels
- Client-side: Browser cache, mobile app cache
- CDN: Edge servers cache static assets
- Reverse proxy: Nginx caches responses
- Application cache: In-memory (Redis, Memcached)
- Database cache: Query result cache, buffer pool
Cache Patterns
Cache-Aside (Lazy Loading)
1. Check cache
2. If miss: fetch from DB, populate cache
3. Return data
- Pros: Only caches requested data
- Cons: Cache miss penalty, potential stale data
Read-Through
- Cache sits between app and DB
- Cache handles DB fetching automatically
- Pros: Simpler app code
- Cons: Cache miss still slow
Write-Through
- Writes go to cache and DB synchronously
- Pros: Cache always consistent
- Cons: Higher write latency
Write-Behind (Write-Back)
- Writes go to cache, asynchronously written to DB
- Pros: Low write latency
- Cons: Risk of data loss, complex
Write-Around
- Writes go directly to DB, bypass cache
- Pros: Avoids cache pollution from writes
- Cons: Cache miss on next read
Cache Eviction Policies
- LRU (Least Recently Used): Evict oldest accessed item (good general purpose)
- LFU (Least Frequently Used): Evict least accessed item (good for stable access patterns)
- FIFO (First In First Out): Evict oldest item (simple but not optimal)
- TTL (Time To Live): Evict after fixed time (good for time-sensitive data)
- Random: Evict random item (simple, surprisingly effective)
Cache Stampede (Thundering Herd)
- Problem: Cache expires, multiple requests hit DB simultaneously
- Solutions:
- Lock on cache miss (first request fetches, others wait)
- Probabilistic early expiration
- Background refresh before expiration
3. Scalability & Reliability
Load Balancing Algorithms
Round Robin
- Distribute requests sequentially across servers
- Pros: Simple, fair distribution
- Cons: Doesn't consider server load/capacity
- Weighted Round Robin: Assign more requests to powerful servers
Least Connections
- Send to server with fewest active connections
- Pros: Better for long-lived connections
- Cons: Requires tracking connection state
Consistent Hashing
- Hash both requests and servers onto ring
- Request goes to next clockwise server
- Pros: Minimal redistribution when adding/removing servers
- Cons: Can create hotspots
- Solution: Virtual nodes (multiple positions per server)
- Use case: Distributed caches, sharding
IP Hash
- Hash client IP to determine server
- Pros: Same client always goes to same server (session affinity)
- Cons: Uneven distribution if IPs not diverse
Least Response Time
- Send to server with fastest response
- Pros: Adapts to server performance
- Cons: Requires health checks, more complex
Rate Limiting & Throttling
Why Rate Limit?
- Prevent abuse/DoS attacks
- Fair resource allocation
- Cost control (API quotas)
- Ensure quality of service
Algorithms
Token Bucket
- Bucket holds tokens (refilled at fixed rate)
- Request consumes token
- Pros: Handles bursts, smooth rate
- Cons: More complex
- Example: AWS API Gateway
Leaky Bucket
- Requests enter bucket, leak out at fixed rate
- Pros: Smooth output rate
- Cons: No burst handling
- Use case: Network traffic shaping
Fixed Window
- Allow N requests per time window (e.g., 100/hour)
- Pros: Simple
- Cons: Burst at window boundaries (200 requests in 1 second if split across windows)
Sliding Window Log
- Track timestamp of each request
- Count requests in sliding time window
- Pros: Accurate, no boundary burst
- Cons: Memory intensive (store all timestamps)
Sliding Window Counter
- Combines fixed window + weighted previous window
- Pros: Accurate, memory efficient
- Cons: Slightly complex
Implementation
- Storage: Redis (INCR with EXPIRE)
- Response: 429 Too Many Requests + Retry-After header
- Distributed: Use centralized Redis, not in-memory (consistent across servers)
Message Queues & Streams
Use Cases
- Decoupling: Producers/consumers don't need to know about each other
- Async processing: Handle time-consuming tasks
- Load leveling: Queue absorbs traffic spikes
- Reliability: Messages persist until processed
Kafka
- Model: Distributed log (append-only)
- Key features:
- High throughput (millions msgs/sec)
- Partitions for parallelism
- Persistent storage
- Consumer groups
- Replay capability (seek to offset)
- Use cases: Event streaming, log aggregation, real-time analytics
RabbitMQ
- Model: Traditional message broker
- Key features:
- Multiple exchange types (direct, topic, fanout)
- Acknowledgments
- Priority queues
- Dead letter queues
- Use cases: Task queues, RPC, routing
AWS SQS
- Model: Managed queue service
- Types:
- Standard: At-least-once delivery, best-effort ordering
- FIFO: Exactly-once, strict ordering
- Features: Auto-scaling, dead letter queues, visibility timeout
- Use cases: Decoupling microservices, job queues
Backpressure
- Problem: Slow consumers can't keep up with producers
- Solutions:
- Push-back to producers (reject requests)
- Dynamic batching
- Increase consumer parallelism
- Drop messages (if acceptable)
Consumer Groups
- Concept: Multiple consumers in group share message processing
- Kafka: Each partition assigned to one consumer in group (parallelism = partition count)
- Benefits: Horizontal scaling, fault tolerance
Leader Election
Why?
- Ensure single coordinator in distributed system
- Prevent split-brain scenarios
- Coordinate distributed operations
Algorithms
Raft
- Leader elected via voting
- Heartbeats maintain leadership
- Log replication for state machine
- Pros: Understandable, proven
- Used in: etcd, Consul
Paxos
- Consensus via proposers, acceptors, learners
- Pros: Theoretically sound
- Cons: Complex to implement
- Used in: Google Chubby
ZooKeeper (ZAB protocol)
- Centralized coordination service
- Sequential consistency
- Use cases: Configuration management, leader election, distributed locks
- Drawback: Single point of failure (mitigated by quorum)
Failover & Redundancy
Active-Passive (Master-Standby)
- Setup: One active server, one standby
- Failover: Standby takes over if active fails
- Pros: Simpler, no split-brain risk
- Cons: Wasted resources (standby idle)
- Use case: Databases, critical services
Active-Active (Multi-Master)
- Setup: All servers handle traffic
- Pros: Better resource utilization, no failover delay
- Cons: Complex conflict resolution, data sync
- Use case: Stateless services, CDNs
Health Checks
- Types:
- Passive: Monitor logs/metrics
- Active: Periodic pings/HTTP checks
- Considerations: Check interval vs false positives, cascading failures
4. System Design Patterns
Read vs Write-Heavy Systems
Read-Heavy Optimization
- Caching: Aggressive caching (Redis, CDN)
- Read replicas: Multiple database replicas
- Denormalization: Duplicate data to avoid joins
- Indexing: Optimize for common queries
- Examples: Social media feeds, news sites, e-commerce browsing
Write-Heavy Optimization
- Write buffering: Queue writes, batch inserts
- Asynchronous processing: Background workers
- Eventual consistency: Accept temporary inconsistency
- Sharding: Distribute writes across nodes
- Optimize indexes: Fewer indexes (faster writes, slower reads)
- Examples: IoT data ingestion, logging systems, analytics
CQRS (Command Query Responsibility Segregation)
Concept
- Separate models for reads (queries) and writes (commands)
- Different databases optimized for each
Write Side (Command)
- Handles business logic
- Validates and processes commands
- Emits events
Read Side (Query)
- Optimized for queries (denormalized views)
- Updated via events from write side
- Eventually consistent
Benefits
- Independent scaling of reads/writes
- Optimized data models for each
- Clear separation of concerns
Drawbacks
- Increased complexity
- Eventual consistency challenges
- Need to sync read models
Use Cases
- Complex domains (e-commerce, banking)
- Different read/write patterns
- Multiple read models (different views of data)
Event Sourcing
Concept
- Store events (state changes) instead of current state
- Rebuild state by replaying events
Example
Events:
1. AccountCreated(id:123, balance:0)
2. MoneyDeposited(id:123, amount:100)
3. MoneyWithdrawn(id:123, amount:30)
Current state: balance = 70 (derived from events)
Benefits
- Complete audit trail
- Time travel (replay to any point)
- Debugging (see what happened)
- Multiple projections from same events
Drawbacks
- Complex queries (need to replay events)
- Storage growth
- Schema evolution challenges
Combined with CQRS
- Events from write side → update read models
- Perfect synergy
Caching Patterns (Revisited)
Write-Through Caching
- Flow: Write to cache → sync write to DB
- Pros: Cache always up-to-date
- Cons: Higher write latency (dual writes)
- Use case: Read-heavy with some writes
Write-Back (Write-Behind) Caching
- Flow: Write to cache → async write to DB (batched)
- Pros: Fast writes, batch optimization
- Cons: Data loss risk, complexity
- Use case: High write throughput (logs, analytics)
Write-Around Caching
- Flow: Write directly to DB, invalidate/bypass cache
- Pros: Doesn't pollute cache with write data
- Cons: Next read will be cache miss
- Use case: Write-once, read-rarely data
Idempotency & Retries
Idempotency
- Definition: Same operation can be applied multiple times without changing result
- Examples:
- Idempotent: DELETE /user/123 (same result each time)
- Not idempotent: POST /user (creates new user each time)
Why Important?
- Network failures require retries
- Distributed systems need duplicate handling
- Prevent double-charging, duplicate records
Implementation
- Idempotency keys: Client generates unique ID per request
POST /payment
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000 - Server stores: key → result mapping
- On retry: Return cached result if key exists
Retry Strategies
- Exponential backoff: Wait 1s, 2s, 4s, 8s... (with jitter)
- Circuit breaker: Stop retrying after threshold (prevent cascading failures)
- Deadline propagation: Don't retry if deadline exceeded
Consistency Models
Strong Consistency (Linearizability)
- Reads always return latest write
- Pros: Simple reasoning, no surprises
- Cons: Higher latency, lower availability
- Examples: Traditional RDBMS, ZooKeeper
- Use case: Financial transactions, inventory
Eventual Consistency
- Reads may return stale data temporarily
- Eventually all replicas converge
- Pros: High availability, low latency
- Cons: Complex application logic
- Examples: DynamoDB, Cassandra, DNS
- Use case: Social media, analytics
Causal Consistency
- Causally-related operations seen in order
- Concurrent operations can be seen in any order
- Example:
- Post message → Like message (causal, must be ordered)
- Two likes from different users (concurrent, any order OK)
- Use case: Collaborative applications
Read-Your-Writes Consistency
- User always sees their own updates
- Others may have delay
- Implementation: Route user's reads to same replica
Monotonic Reads
- If user sees value, subsequent reads won't see older value
- Prevents: Reading from lagging replica after reading from up-to-date one
5. Advanced Caching
CDN Deep Dive
-
Push CDN: Origin server pushes content to edge servers proactively
- Pros: Lower latency, predictable
- Cons: Wasted bandwidth for unpopular content
- Use case: Popular content known in advance
-
Pull CDN: Edge servers pull content on-demand (cache-aside)
- Pros: Only cache requested content
- Cons: First request slow (cold cache)
- Use case: Long-tail content distribution
Edge Computing
- Run compute at edge locations (not just caching)
- Use cases:
- A/B testing at edge
- Authentication/authorization
- Request manipulation
- Serverless functions
- Examples: CloudFlare Workers, Lambda@Edge
Redis Advanced Patterns
Redis as Message Broker
- Pub/Sub for real-time messaging
- Streams for event sourcing
- Pros: Fast, simple
- Cons: No message persistence (pub/sub), less feature-rich than Kafka
Redis as Database
- Persistence options: RDB snapshots, AOF logs
- Use cases: Session store, leaderboards, rate limiting
Redis Data Structures
- Strings, Lists, Sets, Sorted Sets, Hashes
- HyperLogLog (cardinality estimation)
- Bitmaps (user activity tracking)
- Geospatial indexes
Application-Level Caching
In-Memory Caching
- Libraries: Caffeine (Java), Go-cache, lru-cache (Node.js)
- Pros: Ultra-fast (no network)
- Cons: Not shared across servers, memory limited
Distributed Caching
- Examples: Redis, Memcached
- Pros: Shared state, larger capacity
- Cons: Network latency, failure modes
Multi-Level Caching
Request → L1 (in-memory) → L2 (Redis) → L3 (Database)
- Benefits: Balance speed and size
- Invalidation: Coordinate across levels
Cache Invalidation Strategies
Time-based (TTL)
- Simplest, works well for slowly-changing data
- Risk: Serving stale data until expiration
Event-based
- Invalidate on data changes
- Methods: Pub/Sub, CDC, explicit invalidation
- Pros: Always fresh data
- Cons: Complexity, potential race conditions
Write-through/Write-behind
- Update cache on writes
- Pros: Cache always current
- Cons: Write overhead
6. Observability
Backoff, Jitter, and Retry Strategies
Exponential Backoff
delay = base_delay * (2 ^ attempt)
Example: 1s, 2s, 4s, 8s, 16s...
- Problem: Thundering herd (all clients retry simultaneously after backoff)
Jitter
delay = base_delay * (2 ^ attempt) * random(0.5, 1.5)
- Benefit: Spreads out retries, prevents synchronized thundering herd
- Types:
- Full jitter: Random between 0 and max delay
- Equal jitter: Half fixed, half random
- Decorrelated jitter: Each attempt varies independently
Circuit Breaker Pattern
States:
- Closed: Normal operation, requests pass through
- Open: Errors exceed threshold, block requests immediately (fail fast)
- Half-Open: After timeout, allow test request
- Success → Close circuit
- Failure → Reopen circuit
Benefits:
- Prevent cascading failures
- Give downstream service time to recover
- Fast failure instead of waiting for timeout
Logging Best Practices
Structured Logging
{
"timestamp": "2025-10-02T10:30:00Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc-123",
"user_id": "456",
"message": "Payment processing failed",
"error": "insufficient funds"
}
- Benefits: Machine-parseable, searchable, aggregatable
Log Levels
- ERROR: Failures requiring immediate attention
- WARN: Unexpected but handled (retry success, degraded mode)
- INFO: Important business events (user signup, payment)
- DEBUG: Detailed diagnostic info (development only)
Correlation IDs
- Unique ID per request, propagated across services
- Benefit: Trace request flow through distributed system
- Implementation: X-Request-ID header
What NOT to Log
- Passwords, API keys, PII (privacy/security)
- Excessive debugging in production (cost/noise)
- Every database query (performance)
Monitoring & Metrics
Key Metrics (RED Method)
- Rate: Requests per second
- Errors: Error rate/count
- Duration: Latency (p50, p95, p99)
USE Method (for Resources)
- Utilization: % time resource busy (CPU, memory)
- Saturation: Queue length (requests waiting)
- Errors: Error count
Golden Signals (Google SRE)
- Latency: Time to serve request
- Traffic: Request volume
- Errors: Failed request rate
- Saturation: System fullness (how close to capacity)
Metric Types
- Counter: Monotonically increasing (requests_total)
- Gauge: Current value (cpu_usage, active_connections)
- Histogram: Distribution of values (request_duration_seconds)
- Summary: Like histogram but calculates quantiles
Prometheus & Grafana
- Prometheus: Time-series database, pull-based scraping
- PromQL query language
- Alert manager integration
- Grafana: Visualization dashboards
- Multiple data sources
- Alerting, annotations
Alerting Best Practices
Alert Fatigue Prevention
- Alert on symptoms, not causes (user-facing issues, not low-level)
- Make alerts actionable (clear remediation steps)
- Avoid duplicate alerts
- Use severity levels (critical vs warning)
On-Call Considerations
- Runbooks: Step-by-step troubleshooting guides
- Escalation policies: Who to notify, when
- Postmortems: Blameless analysis after incidents
Distributed Tracing
Concept
- Track request flow across microservices
- Visualize latency bottlenecks
- Identify failing service in chain
Implementation
- Trace: Single request journey
- Span: Individual operation (DB query, HTTP call)
- Trace ID: Unique identifier for entire trace
- Span ID: Unique identifier for operation
Tools
- Jaeger: Open-source, CNCF project
- Zipkin: Twitter-originated
- OpenTelemetry: Vendor-neutral standard (merges OpenTracing + OpenCensus)
Example Trace
Frontend (50ms)
├─ Auth Service (10ms)
├─ Product Service (30ms)
│ └─ Database Query (25ms) ← Bottleneck!
└─ Payment Service (5ms)
SLO, SLI, SLA
SLI (Service Level Indicator)
- Definition: Metric representing system health
- Examples:
- Request success rate
- Request latency (p95 < 200ms)
- System uptime
SLO (Service Level Objective)
- Definition: Target value/range for SLI
- Example: "99.9% of requests succeed"
- Purpose: Internal goal for reliability
SLA (Service Level Agreement)
- Definition: Contract with users (consequences if SLO missed)
- Example: "99.9% uptime or customer gets refund"
- Relationship: SLA ≤ SLO (SLO should be stricter to have buffer)
Error Budgets
- Concept: Acceptable downtime based on SLO
- Example: 99.9% SLO = 43 minutes downtime/month allowed
- Usage: If budget exhausted, freeze features and focus on reliability
7. Security & Privacy
Authentication vs Authorization
Authentication
- Definition: Verifying who the user is
- Methods:
- Username/password
- Multi-factor authentication (MFA)
- Biometrics
- Certificate-based
Authorization
- Definition: What the user can do
- Models:
- RBAC (Role-Based): User has roles (admin, editor), roles have permissions
- ABAC (Attribute-Based): Policies based on attributes (department, clearance level)
- ACL (Access Control List): Resource lists who can access
Authentication Mechanisms
JWT (JSON Web Tokens)
Structure: Header.Payload.Signature
{
"sub": "user123",
"name": "John Doe",
"exp": 1730000000
}
- Pros: Stateless, self-contained, scalable
- Cons: Can't revoke (until expiry), token size
- Use case: API authentication, microservices
OAuth 2.0
- Purpose: Delegated authorization (allow app access without sharing password)
- Flow Example (Authorization Code):
- User clicks "Login with Google"
- Redirect to Google (authorization server)
- User approves
- Google redirects back with authorization code
- App exchanges code for access token
- App uses token to access Google APIs
- Roles:
- Resource Owner (user)
- Client (your app)
- Authorization Server (Google)
- Resource Server (Google APIs)
SSO (Single Sign-On)
- Definition: One login for multiple applications
- Protocols: SAML, OAuth 2.0, OpenID Connect
- Benefits: Better UX, centralized access control
- Use case: Enterprise applications
Session-Based Authentication
- Flow:
- User logs in
- Server creates session, stores in DB/Redis
- Returns session ID cookie
- Client sends cookie with requests
- Pros: Can revoke immediately
- Cons: Stateful, harder to scale (requires sticky sessions or shared session store)
Encryption
TLS/HTTPS
- Purpose: Encrypt data in transit
- Handshake:
- Client Hello (supported cipher suites)
- Server Hello (chosen cipher, certificate)
- Client verifies certificate
- Key exchange (establish shared secret)
- Encrypted communication
- TLS 1.3: Faster handshake, stronger security
Data at Rest
- Encryption: AES-256 (symmetric encryption)
- Key management: HSM (Hardware Security Module), KMS (Key Management Service)
- Database encryption:
- Full disk encryption
- Column-level encryption (for sensitive fields)
- Application-level encryption: Encrypt before storing
Hashing vs Encryption
- Hashing: One-way (passwords)
- Use bcrypt, Argon2 (not MD5, SHA1)
- Salt to prevent rainbow tables
- Encryption: Two-way (reversible with key)
- Use AES, RSA
DDoS Protection & WAF
DDoS (Distributed Denial of Service)
Types:
- Volumetric: Flood with traffic (UDP flood, amplification attacks)
- Protocol: Exploit protocol weaknesses (SYN flood)
- Application Layer: Target application (HTTP flood)
Mitigation:
- Rate limiting: Per IP, per endpoint
- CDN: Absorb traffic at edge
- Anycast: Distribute traffic across locations
- Traffic analysis: Identify and block malicious patterns
- Overprovisioning: Have excess capacity
WAF (Web Application Firewall)
- Purpose: Filter malicious HTTP traffic
- Protection against:
- SQL injection
- XSS (Cross-Site Scripting)
- CSRF (Cross-Site Request Forgery)
- Path traversal
- Types:
- Network-based (hardware appliance)
- Host-based (integrated in app)
- Cloud-based (CloudFlare, AWS WAF)
- Rules: Signature-based, behavioral analysis
Data Privacy (GDPR Basics)
Key Principles
- Lawful basis: Need consent or legitimate interest
- Data minimization: Collect only necessary data
- Purpose limitation: Use data only for stated purpose
- Storage limitation: Don't keep data longer than needed
- Accuracy: Keep data up-to-date
- Security: Protect with encryption, access controls
User Rights
- Right to access: User can request their data
- Right to erasure: "Right to be forgotten" (delete data)
- Right to portability: Export data in machine-readable format
- Right to rectification: Correct inaccurate data
Implementation Considerations
- Data inventory: Know what PII you collect
- Consent management: Track and honor user consent
- Data retention policies: Auto-delete old data
- Breach notification: Report breaches within 72 hours
- Privacy by design: Build privacy into systems from start
8. Infrastructure & Deployment
Containers & Orchestration
Docker
Key Concepts:
- Image: Read-only template (base OS + app + dependencies)
- Container: Running instance of image
- Dockerfile: Instructions to build image
- Layers: Each instruction creates layer (cached for efficiency)
Benefits:
- Consistent environments (dev = prod)
- Lightweight (vs VMs)
- Fast startup
- Isolated processes
Kubernetes (K8s)
Architecture:
- Control Plane: Master node(s)
- API Server
- Scheduler (assigns pods to nodes)
- Controller Manager (maintains desired state)
- etcd (distributed config store)
- Worker Nodes: Run pods
- kubelet (agent)
- kube-proxy (networking)
- Container runtime (Docker, containerd)
Key Resources:
- Pod: Smallest unit (1+ containers)
- Deployment: Manages replica sets, rolling updates
- Service: Stable endpoint for pods (load balancing)
- ConfigMap: Configuration data
- Secret: Sensitive data (encrypted)
- Ingress: HTTP(S) routing to services
- Namespace: Virtual clusters for isolation
Benefits:
- Auto-scaling (HPA - Horizontal Pod Autoscaler)
- Self-healing (restart failed pods)
- Rolling updates, rollbacks
- Service discovery
- Storage orchestration
CI/CD Pipelines
Continuous Integration (CI)
- Goal: Frequently merge code to main branch
- Pipeline:
- Code commit triggers build
- Run tests (unit, integration)
- Static analysis (linting, security scans)
- Build artifacts (Docker images)
- Benefits: Catch bugs early, reduce merge conflicts
Continuous Deployment (CD)
- Goal: Automatically deploy to production
- Pipeline:
- Successful CI build
- Deploy to staging
- Run E2E tests
- Deploy to production (if tests pass)
Deployment Strategies
Blue-Green Deployment
- Setup: Two identical environments (Blue = current, Green = new)
- Process:
- Deploy new version to Green
- Test Green
- Switch traffic from Blue to Green
- Keep Blue for quick rollback
- Pros: Zero downtime, instant rollback
- Cons: Double resources
Canary Deployment
- Process:
- Deploy new version to small % of servers (5%)
- Monitor errors, performance
- Gradually increase % (10%, 25%, 50%, 100%)
- Rollback if issues
- Pros: Lower risk, real-world testing
- Cons: Slower rollout, complex routing
Rolling Deployment
- Process: Update servers one-by-one (or in batches)
- Pros: No extra resources
- Cons: Mixed versions during rollout, slower rollback
Feature Flags
- Deploy code with features disabled
- Enable features gradually (per user, %)
- Pros: Decouple deployment from release, A/B testing
- Cons: Code complexity, technical debt
Service Discovery
Problem
- Microservices need to find each other
- IPs/ports change dynamically (scaling, failures)
Client-Side Discovery
- Process: Client queries service registry, chooses instance, makes request
- Examples: Netflix Eureka
- Pros: Client controls load balancing
- Cons: Client complexity, tight coupling to registry
Server-Side Discovery
- Process: Client requests load balancer, load balancer queries registry
- Examples: AWS ELB, Kubernetes Service
- Pros: Client simplicity
- Cons: Load balancer is potential bottleneck/SPOF
Service Registry
- Examples: Consul, etcd, ZooKeeper
- Features:
- Health checks
- Automatic deregistration of failed services
- DNS interface
Kubernetes Service Discovery
- Built-in via DNS and environment variables
- ClusterIP service provides stable virtual IP
- DNS:
service-name.namespace.svc.cluster.local
API Gateway
Purpose
- Single entry point for clients
- Abstracts backend complexity
Responsibilities
- Routing: Direct requests to appropriate microservice
- Authentication/Authorization: Centralized security
- Rate limiting: Protect backend services
- Request/Response transformation: Adapt protocols/formats
- Caching: Reduce backend load
- Logging & Monitoring: Centralized observability
- SSL termination: Handle TLS at gateway
Patterns
- Backend for Frontend (BFF): Separate gateway per client type (web, mobile, IoT)
Examples
- Kong, AWS API Gateway, Apigee, Zuul
Microservices vs Monoliths
Monolith
Pros:
- Simple to develop, test, deploy (initially)
- No network latency between components
- Easier transactions (single DB)
- Simpler debugging
Cons:
- Scales as one unit (can't scale component independently)
- Tech stack lock-in
- Deployment risk (entire app redeployed)
- Codebase becomes unwieldy
Microservices
Pros:
- Independent scaling per service
- Technology diversity
- Fault isolation (one service failure doesn't crash all)
- Faster deployments (small, independent)
- Team autonomy
Cons:
- Distributed system complexity (network failures, latency)
- Data consistency challenges
- Increased operational overhead (monitoring, deployment)
- Testing complexity
When to Use
- Monolith: Startups, simple domains, small teams
- Microservices: Large orgs, complex domains, independent team scaling
Migration Strategy
- Start with monolith
- Extract services as domain understanding grows
- "Strangler Fig" pattern (gradually replace monolith pieces)
9. Special Topics
Search Systems
Inverted Index (Deep Dive)
Structure:
Term → [Doc1, Doc2, ...]
"hello" → [doc1, doc3, doc5]
"world" → [doc1, doc2]
With Positions (for phrase queries):
"hello" → {doc1: [0, 15], doc3: [5]}
Search Process:
- Tokenize query: "hello world" → ["hello", "world"]
- Lookup each term in index
- Intersect posting lists: doc1 (appears in both)
- Rank results by relevance
Ranking Algorithms
TF-IDF (Term Frequency-Inverse Document Frequency)
- TF: How often term appears in document
- IDF: How rare term is across all documents
- Score: TF × IDF (common terms in rare documents rank high)
BM25 (Best Match 25)
- Improved TF-IDF with diminishing returns
- Considers document length normalization
- Industry standard
Elasticsearch Architecture
- Cluster: Multiple nodes
- Index: Collection of documents (like database)
- Shard: Subset of index data (for horizontal scaling)
- Replica: Copy of shard (for availability)
Query Types:
- Match query (full-text search)
- Term query (exact match)
- Range query (dates, numbers)
- Bool query (AND, OR, NOT)
- Fuzzy query (typo tolerance)
Bloom Filters
Problem
- Check if element exists in set
- Traditional: Hash table (space inefficient for large sets)
Bloom Filter
- Data structure: Bit array + k hash functions
- Add: Set bits at k hash positions to 1
- Check: If all k bits are 1, element might exist
- False positives: Possible (bits set by other elements)
- False negatives: Impossible (if bits set, element was definitely added or collision)
Use Cases
- Database: Check if key exists before expensive disk lookup
- Web: Block malicious URLs (quick check before full validation)
- Distributed systems: Reduce unnecessary network calls
- Example: Google Chrome uses bloom filters for malicious site detection
Trade-off
- Space efficient (small bit array)
- Tunable false positive rate (more bits/hashes = fewer false positives)
Recommendation Systems
Collaborative Filtering
User-based:
- Find similar users (based on past behavior)
- Recommend items those users liked
- Example: Users who liked A and B also liked C
Item-based:
- Find similar items (based on user interactions)
- Recommend similar items to what user liked
- Example: People who liked this movie also liked...
Matrix Factorization (Netflix Prize winner):
- Decompose user-item matrix into latent factors
- Predict missing ratings
Content-Based Filtering
- Recommend based on item attributes
- Example: User likes sci-fi movies → recommend other sci-fi
Hybrid Approaches
- Combine collaborative + content-based
- Cold start problem: Use content-based for new users/items
Ranking
- Factors: Relevance, popularity, diversity, freshness
- ML models: Gradient boosting, neural networks
- A/B testing: Compare ranking algorithms
Distributed Transactions
Problem
- Transaction spans multiple databases/services
- Need ACID guarantees across systems
Two-Phase Commit (2PC)
Phase 1 (Prepare):
- Coordinator asks all participants: "Can you commit?"
- Participants lock resources, respond yes/no
Phase 2 (Commit/Abort):
- If all said yes: Coordinator sends "commit" to all
- If any said no: Coordinator sends "abort" to all
Problems:
- Blocking: If coordinator crashes, participants locked
- Single point of failure: Coordinator
- Performance: Synchronous, slow
Saga Pattern
Concept: Break transaction into local transactions, compensate on failure
Example (booking trip):
- Book flight (local tx)
- Book hotel (local tx)
- Book car (local tx) If step 3 fails → compensate: cancel hotel, cancel flight
Types:
- Choreography: Services communicate via events (decentralized)
- Orchestration: Central coordinator (like 2PC but async)
Pros: No blocking, better availability Cons: Eventual consistency, complex compensation logic
When to Use
- 2PC: When strong consistency absolutely required (rare)
- Saga: Most distributed systems (accept eventual consistency)
- Avoid distributed transactions: Design to avoid need (bounded contexts)
Consensus Algorithms
Why Needed?
- Distributed systems need to agree on values
- Leader election, configuration, distributed locks
Raft (Understandable Consensus)
Roles:
- Leader: Handles all client requests
- Follower: Passive, replicate leader's log
- Candidate: Follower becomes candidate during election
Leader Election:
- Leader sends heartbeats
- If follower doesn't hear heartbeat (timeout) → becomes candidate
- Candidate requests votes from other nodes
- Majority votes → becomes leader
Log Replication:
- Leader receives command, appends to log
- Sends log entry to followers
- When majority replicate → entry committed
- Leader notifies followers, applies to state machine
Guarantees:
- Only one leader per term
- Logs eventually identical across servers
- Committed entries durable
Paxos
- More complex than Raft (harder to understand/implement)
- Three roles: Proposers, Acceptors, Learners
- Multi-Paxos optimized for multiple decisions
- Used in: Google Chubby, Spanner
Practical Usage
- Don't implement yourself: Use existing (etcd, Consul, ZooKeeper)
- Use for: Leader election, distributed config, locking
- Not for: Every coordination need (too heavyweight)
Time & Ordering in Distributed Systems
Problem
- No global clock in distributed systems
- Clock skew (servers have different times)
- Need to order events across servers
Lamport Clocks (Logical Time)
- Each process maintains counter
- Rules:
- Increment counter before each event
- Send counter with message
- Receiver sets counter = max(local, received) + 1
- Property: If event A happened-before B, then timestamp(A) < timestamp(B)
- Limitation: Converse not true (can't determine causality from timestamps alone)
Vector Clocks
- Each process maintains vector of counters (one per process)
- Example: [P1:3, P2:5, P3:2]
- Rules:
- Increment own counter on event
- Send entire vector with message
- Receiver merges vectors (max of each component)
- Property: Can determine causality
- A happened-before B: VA < VB (component-wise)
- Concurrent events: Neither VA < VB nor VB < VA
Use Cases
- Lamport clocks: Total ordering of events (distributed snapshots)
- Vector clocks: Conflict detection (Riak, Dynamo)
- Example: Detect concurrent updates to same key
True Time (Google Spanner)
- Uses atomic clocks + GPS for global time
- Time is interval (t ± ε) accounting for uncertainty
- Wait out uncertainty before committing (ensures causality)
10. Additional Important Topics
Back-of-the-Envelope Calculations
Common Numbers (Latency)
- L1 cache: 0.5 ns
- L2 cache: 7 ns
- RAM: 100 ns
- SSD read: 150 μs
- HDD seek: 10 ms
- Network within datacenter: 0.5 ms
- Round trip CA to Netherlands: 150 ms
Storage Capacity
- 1 KB = 1,000 bytes
- 1 MB = 1,000 KB
- 1 GB = 1,000 MB
- 1 TB = 1,000 GB
- 1 PB = 1,000 TB
Traffic Estimates
- Example: 100M DAU, average 10 requests/day
- QPS = 100M × 10 / 86400 ≈ 11,574 req/s
- Peak QPS ≈ 2-3× average ≈ 30,000 req/s
Storage Estimates
- Example: 1M tweets/day, 280 chars average, 5 years retention
- 280 bytes × 1M × 365 × 5 ≈ 500 GB
Polling vs Push vs Long Polling
Polling
- Client periodically requests updates
- Pros: Simple, stateless
- Cons: Wasted requests (if no updates), delayed updates
Push (WebSockets, SSE)
- Server pushes updates when available
- Pros: Real-time, efficient
- Cons: Complex, stateful connections
Long Polling
- Client requests, server holds until update available (or timeout)
- Pros: More real-time than polling, better compatibility than WebSockets
- Cons: Still overhead of reconnections
Database Connection Pooling
- Problem: Creating DB connections is expensive (TCP handshake, auth)
- Solution: Pool of reusable connections
- Benefits: Faster response, controlled max connections
- Configuration: Min/max pool size, connection timeout, idle timeout
Partitioning vs Sharding
- Often used interchangeably
- Partitioning: Splitting data (can be on same server)
- Horizontal: Split rows (same schema)
- Vertical: Split columns (different tables)
- Sharding: Horizontal partitioning across multiple servers
Webhooks
- Concept: Server calls client URL when event occurs
- Use cases: Payment notifications, GitHub push events
- Considerations:
- Retry logic (client might be down)
- Idempotency (duplicates possible)
- Security (validate sender, HTTPS)
Reverse Hash Lookup (Distributed Hash Table)
- Use case: P2P systems (BitTorrent, blockchain)
- Concept: Hash key maps to node responsible for storing it
- Consistent hashing: Add/remove nodes with minimal reshuffling
Interview Preparation Tips
How to Approach System Design Interviews
1. Clarify Requirements (5 min)
- Functional: What features? (read/write, search, notifications)
- Non-functional: Scale (users, requests/sec), latency, availability
- Constraints: Budget, timeline, existing infrastructure
2. Back-of-Envelope Estimates (5 min)
- Calculate QPS, storage, bandwidth
- Determine scale tier (thousands vs millions vs billions)
3. High-Level Design (10-15 min)
- Draw main components (client, load balancer, servers, databases, cache)
- API design (key endpoints, request/response)
- Data model (tables, relationships)
4. Deep Dive (15-20 min)
- Interviewer will probe specific areas
- Be ready to discuss: scaling, failures, bottlenecks, trade-offs
- Common deep dives: Database choice, caching strategy, consistency model
5. Wrap Up (5 min)
- Monitoring, metrics, alerts
- Potential improvements, future scaling
Common Mistakes to Avoid
- Jumping to solution without clarifying requirements
- Over-engineering (don't add Kafka if simple queue suffices)
- Ignoring trade-offs (every decision has pros/cons)
- Not considering failures (what if DB goes down?)
- Forgetting about monitoring/observability
Key Trade-offs to Discuss
- Consistency vs Availability (CAP theorem)
- Latency vs Throughput (batch processing vs real-time)
- Normalization vs Denormalization (storage vs query speed)
- SQL vs NoSQL (ACID vs scalability)
- Monolith vs Microservices (simplicity vs scalability)
- Synchronous vs Asynchronous (simplicity vs performance)
Practice Questions
- Design Twitter/Instagram
- Design URL shortener
- Design video streaming (YouTube/Netflix)
- Design messaging system (WhatsApp)
- Design ride-sharing (Uber)
- Design newsfeed
- Design web crawler
- Design search autocomplete
- Design rate limiter
- Design distributed cache
- Design key-value store
- Design notification system
Quick Reference
When to Use SQL vs NoSQL
Use SQL when:
- Need ACID transactions
- Complex queries with joins
- Structured, relational data
- Data integrity critical
Use NoSQL when:
- Massive scale (horizontal scaling)
- Flexible schema
- High write throughput
- Eventual consistency acceptable
Caching Decision Tree
- Frequently accessed data? → Yes → Cache it
- Read-heavy or write-heavy?
- Read-heavy → Cache-aside
- Write-heavy → Write-through or write-behind
- Consistency critical?
- Yes → Write-through
- No → Cache-aside with TTL
Database Replication Strategy
- Read-heavy → Master-slave
- Write-heavy + global → Master-master
- Strong consistency → Master-slave with sync replication
- High availability → Master-master or multi-region
Message Queue vs Database
- Use Queue when: Async processing, decoupling, load leveling
- Use Database when: Need to query data, ACID required, persistent storage
This guide covers the essential system design topics for interviews. Remember: there's rarely one "correct" answer in system design. Focus on demonstrating your thought process, understanding trade-offs, and designing for the stated requirements.